使用KNN算法实现手写数字识别

1.文本文件数据等等2.将其3232的二进制图像转换为11024的向量3.测试算法#!/usr/bin/env python# -*- coding: UTF-8 -*-'''=================================================@Project -> File：KNN -> kNN@IDE：PyCharm@Author ：zgq@Date：

y hat

779人浏览 · 2021-01-08 22:23:17

y hat · 2021-01-08 22:23:17 发布

1.文本文件数据
在这里插入图片描述

等等
2.将其3232的二进制图像转换为11024的向量
3.测试算法

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File   ：KNN -> kNN
@IDE    ：PyCharm
@Author ：zgq
@Date   ：2021/1/7 14:15
@Desc   ：
=================================================='''
from numpy import *
import operator #运算符模块
import matplotlib
import matplotlib.pyplot as plt
from os import listdir

def classify0(inX,dataSet,labels,k):
    #inx用于分类的输入向量
    #训练样本集dataset
    #lables标签
    #k最近邻数目

    #距离计算
    dataSetSize=dataSet.shape[0] #dataset有几行
    diffMat=tile(inX,(dataSetSize,1))-dataSet  #输入向量重复了已有数据集的行数，一起减掉，出来一个新的矩阵，每个数字都记录当前新样本该维度与每个样本差值
    sqDiffMat=diffMat**2
    sqDistances=sqDiffMat.sum(axis=1) #所有横轴元素加和
    distances=sqDistances**0.5  #到此处时 distance为一个一位列数组，记录每条样本与新样本的距离
    sortedDistIndicies= distances.argsort() #对distance进行升序排序
    classCount={}   #DICT类型
    for i in range(k):  #寻找距离最小的K个点
        voteIlabel = labels[sortedDistIndicies[i]]   #返回距离排序中前K条数据的标签
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
        #classCount.get(voteIlabel,0) 字典获取vouteIlabel值，没有的话返回0
        #此处for循环将距离最近的K个数据标签进行统计：每次for循环第一步，将第i个标签记录到voteIlable中，第二部将该标签出现后再dict中次数加一
    sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]


#将img数据转换为向量
def img2vector(filename):
    returnVect=zeros((1,1024))
    fr=open(filename)
    for i in range(32):
        lineStr=fr.readline()
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    return  returnVect


#手写数字识别系统的测试代码
def handwritingClassTest():
    hwLabels=[]
    trainingFileList=listdir('trainingDigits')  #listdir可以列出给定目录的文件名
    m=len(trainingFileList)
    trainingMat=zeros((m,1024))
    for i in range(m):
        fileNameStr=trainingFileList[i] #获取当前第i个文件名
        fileStr=fileNameStr.split('.')[0]   #先用点来切分，切分为0_0和txt [0_0,txt]取第0项
        classNumStr=int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)    #将所有的标签按照顺序添加到了hwLables中
        trainingMat[i,:]=img2vector('trainingDigits/%s' % fileNameStr)  #顺便将每一个文件都转为向量存入trainingMat中
    testFileList=listdir('testDigits')  #将测试文件的名字作为列表给予testFilelist
    errorCount=0.0
    mTest=len(testFileList) #取test的集的总数
    for i in range(mTest):
        fileNameStr=testFileList[i]
        fileStr=fileNameStr.split('.')[0]
        classNumStr=int(fileStr.split('_')[0])
        vectorUnderTest=img2vector('testDigits/%s' % fileNameStr)   #拿出一条测试集数据构成测试向量
        classifierResult=classify0(vectorUnderTest,trainingMat,hwLabels,3)  #此处训练集样本和标签行数是对齐的
        print("the classifier came back with : %d,the real answer is : %d" %(classifierResult,classNumStr))
        if (classifierResult!=classNumStr):
            errorCount=errorCount+1.0
    print("\n the total number of errors is :%d" % errorCount)
    print("\n the total error rate is: %f" % (errorCount/float(mTest)))

handwritingClassTest()

测试结果：

the classifier came back with : 0,the real answer is : 0
the classifier came back with : 0,the real answer is : 0
the classifier came back with : 0,the real answer is : 0
the classifier came back with : 0,the real answer is : 0
the classifier came back with : 0,the real answer is : 0
the classifier came back with : 0,the real answer is : 0
……
the classifier came back with : 1,the real answer is : 1
the classifier came back with : 7,the real answer is : 1
the classifier came back with : 1,the real answer is : 1
……
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 6,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
the classifier came back with : 3,the real answer is : 8
the classifier came back with : 8,the real answer is : 8
……

the total number of errors is :10
 the total error rate is: 0.010571