TensorFlow入门教程(18)语音识别(中)

##作者：韦访#博客：https://blog.csdn.net/rookie_wei#微信：1007895847#添加微信的备注一下是CSDN的#欢迎大家一起学习#------韦访 201811266、提取音频数据的MFCC特征上一讲花了很大的篇幅来将这个MFCC特征，现在我们就来提取它。Python牛逼之处就是有非常多的工具支持各种操作，很完善，所以这里也不需要我们从头...

Fang Wei

9375人浏览 · 2018-11-29 10:54:54

__Fang Wei__ · 2018-11-29 10:54:54 发布

#
#作者：韦访
#博客：https://blog.csdn.net/rookie_wei
#微信：1007895847
#添加微信的备注一下是CSDN的
#欢迎大家一起学习
#

6、提取音频数据的MFCC特征

上一讲花了很大的篇幅来将这个MFCC特征，现在我们就来提取它。Python牛逼之处就是有非常多的工具支持各种操作，很完善，所以这里也不需要我们从头开始写，可以借助python_speech_features工具来实现。

首先来安装python_speech_features工具，执行以下命令行即可，

pip install python_speech_features

我们将语音数据转换为需要计算的13位或26位不同的倒谱特征的MFCC，将它作为模型的输入。经过转换，数据将会被存储在一个频率特征系数（行）和时间（列）的矩阵中。

因为声音不会孤立的产生，并且没有一对一映射到字符，所以，我们可以通过在当前时间索引之前和之后捕获声音的重叠窗口上训练网络，从而捕获共同作用的影响（即通过影响一个声音影响另一个发音）。

这里先插讲一下语音中的“分帧”和“加窗”的概念，

分帧：

如上图所示，傅里叶变换要求输入的信号是平稳的，但是语音信号在宏观上是不平稳的，在微观上却有短时平稳性（10-30ms内可以认为语音信号近似不变）。所以要把语音信号分为一些小段处理，每一个小段称为一帧。

加窗：

取出一帧信号以后，在进行傅里叶变换前，还有先进行“加窗”操作，“加窗”其实就是乘以一个“窗函数”，如下图所示，

加窗的目的是让一帧信号的幅度在两端渐变到0，这样就可以提供变换结果的分辨率。但是加窗也是有代价的，一帧信号的两端被削弱了，弥补的办法就是，邻近的帧直接要有重叠，而不是直接截取，如下图所示，

如上图所示，两帧之间有重叠部分，帧长为25ms，两帧起点位置的时间差叫帧移，一般取10ms或者帧长的一半。

对于RNN，我们使用之前的9个时间片段和后面的9个时间片段，加上当前时间片段，每个加载窗口总共包括19个时间片段。当梅尔倒谱系数为26时，每个时间片段总共就有494个MFCC特征数。下图是以倒谱系数为13为例的加载窗口实例图，

而当当前序列前或后不够9个序列时，比如第2个序列，这时就需要进行补0操作，将它凑够9个。最后，再进行标准化处理，减去均值，然后除以方差。下面来看代码，

# 将音频信息转成MFCC特征
# 参数说明---audio_filename：音频文件   numcep：梅尔倒谱系数个数
#       numcontext：对于每个时间段，要包含的上下文样本个数
def audiofile_to_mfcc_vector(audio_filename, numcep, numcontext):
    # 加载音频文件
    fs, audio = wav.read(audio_filename)
    # 获取MFCC系数
    orig_inputs = mfcc(audio, samplerate=fs, numcep=numcep)
    # 打印MFCC系数的形状，得到比如(955, 26)的形状
    # 955表示时间序列，26表示每个序列的MFCC的特征值为26个
    # 这个形状因文件而异，不同文件可能有不同长度的时间序列，但是，每个序列的特征值数量都是一样的
    # print(np.shape(orig_inputs))
 
    # 因为我们使用双向循环神经网络来训练,它的输出包含正、反向的结
    # 果,相当于每一个时间序列都扩大了一倍,所以
    # 为了保证总时序不变,使用orig_inputs =
    # orig_inputs[::2]对orig_inputs每隔一行进行一次
    # 取样。这样被忽略的那个序列可以用后文中反向
    # RNN生成的输出来代替,维持了总的序列长度。
    orig_inputs = orig_inputs[::2]  # (478, 26)
    # print(np.shape(orig_inputs))
    # 因为我们讲解和实际使用的numcontext=9，所以下面的备注我都以numcontext=9来讲解
    # 这里装的就是我们要返回的数据，因为同时要考虑前9个和后9个时间序列，
    # 所以每个时间序列组合了19*26=494个MFCC特征数
    train_inputs = np.array([], np.float32)
    train_inputs.resize((orig_inputs.shape[0], numcep + 2 * numcep * numcontext))
    # print(np.shape(train_inputs))#)(478, 494)
 
    # Prepare pre-fix post fix context
    empty_mfcc = np.array([])
    empty_mfcc.resize((numcep))
 
    # Prepare train_inputs with past and future contexts
    # time_slices保存的是时间切片，也就是有多少个时间序列
    time_slices = range(train_inputs.shape[0])
 
    # context_past_min和context_future_max用来计算哪些序列需要补零
    context_past_min = time_slices[0] + numcontext
    context_future_max = time_slices[-1] - numcontext
 
    # 开始遍历所有序列
    for time_slice in time_slices:
        # 对前9个时间序列的MFCC特征补0，不需要补零的，则直接获取前9个时间序列的特征
        need_empty_past = max(0, (context_past_min - time_slice))
        empty_source_past = list(empty_mfcc for empty_slots in range(need_empty_past))
        data_source_past = orig_inputs[max(0, time_slice - numcontext):time_slice]
        assert (len(empty_source_past) + len(data_source_past) == numcontext)
 
        # 对后9个时间序列的MFCC特征补0，不需要补零的，则直接获取后9个时间序列的特征
        need_empty_future = max(0, (time_slice - context_future_max))
        empty_source_future = list(empty_mfcc for empty_slots in range(need_empty_future))
        data_source_future = orig_inputs[time_slice + 1:time_slice + numcontext + 1]
        assert (len(empty_source_future) + len(data_source_future) == numcontext)
 
        # 前9个时间序列的特征
        if need_empty_past:
            past = np.concatenate((empty_source_past, data_source_past))
        else:
            past = data_source_past
 
        # 后9个时间序列的特征
        if need_empty_future:
            future = np.concatenate((data_source_future, empty_source_future))
        else:
            future = data_source_future
 
        # 将前9个时间序列和当前时间序列以及后9个时间序列组合
        past = np.reshape(past, numcontext * numcep)
        now = orig_inputs[time_slice]
        future = np.reshape(future, numcontext * numcep)
 
        train_inputs[time_slice] = np.concatenate((past, now, future))
        assert (len(train_inputs[time_slice]) == numcep + 2 * numcep * numcontext)
 
    # 将数据使用正太分布标准化，减去均值然后再除以方差
    train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)
 
    return train_inputs

7、文字样本转化成字典和数组

对于文字样本，我们需要将文字转换成向量，这样才方便计算机计算。上一讲中，我们调用get_wavs_and_tran_texts函数获取了所有的WAV文件和其对应的文字。现在，我们先来处理一下文字。先将所有文字提出来，然后，调用collections和Counter方法，统计一下每个字符出现的次数，然后，把它们放到字典里面去，代码如下，

def create_words_table(self):
        # 字表 
        all_words = []  
        for label in self.labels:  
            #print(label)    
            all_words += [word for word in label]
        
        #Counter，返回一个Counter对象集合，以元素为key，元素出现的个数为value
        counter = Counter(all_words)                
        #排序
        self.words = sorted(counter)
        
        self.words_size= len(self.words)

        with open(charactersfile, 'w', encoding='utf-8') as fd:
            for w in self.words:
                fd.write(w)
                fd.write('\n')

        self.words_map = dict(zip(self.words, range(self.words_size)))            
        print("words_map====>>>>", self.words_map)

为了方便，我们将获取音频相关操作封装成AudioProcessor类，该类的初始化函数如下，

def __init__(self, wav_path, tran_path, features, contexts):        
        self.features = features
        self.contexts = contexts
        self.wavs, self.labels = get_wavs_and_tran_texts(wav_path, tran_path)
        self.create_words_table()

features 和contexts 我们暂时先不管，先全部设置成0，然后，调用该类看看效果，代码如下，

if __name__ == "__main__":
    wav_path = 'dataset/data_thchs30/train'
    tran_path = 'dataset/data_thchs30/data'
 
    processor = AudioProcessor(wav_path, tran_path, 0, 0)

运行结果，

可以看到，words_map就是一个字典，我们可以根据字找到其对应的值，这个值我们可以理解为下标。我们再将words打印出来看看，如下，

可以看到，words是一个列表，并且与words_map是一一对应的。即，可以根据words_map找到字的下标，也可以根据words和字的下标找到对应的字。我们将words保存到txt文件，这样我们以后只要加载这个文件就可以了，而不用这么麻烦，应用的时候，也都是只有一个文件就可以了。导入代码如下，

def load_words_table(self):
        self.words = []
        
        with open(charactersfile, 'r', encoding='utf-8') as fd:
            while True:
                c = fd.readline().replace('\n', '')
                if c:
                    self.words += [c]
                else:
                    break

        # print("words====>>>>", self.words)
        self.words_size = len(self.words)
        self.words_map = dict(zip(self.words, range(self.words_size)))    
        # print("words_map====>>>>", self.words_map)

8、将文字转成向量

上面，我们将文字样本转成了字典，现在，我们要将文字转成向量，也就是将类似“绿是阳春烟景大块文章的底色四月的林峦更是绿得鲜活秀媚诗意盎然”的话用向量来表达。代码如下，

# 将字符转成向量，其实就是根据字找到字在words_map中所应对的下标
def get_labels_vector(words_map, txt_label=None):
    words_size = len(words_map)
 
    to_num = lambda word: words_map.get(word, words_size)

    labels_vector = list(map(to_num, txt_label))

    return labels_vector

然后，我们在load_words_table函数中调用它，代码如下，

def load_words_table(self):
        self.words = []
        
        with open(charactersfile, 'r', encoding='utf-8') as fd:
            while True:
                c = fd.readline().replace('\n', '')
                if c:
                    self.words += [c]
                else:
                    break

        # print("words====>>>>", self.words)
        self.words_size = len(self.words)
        self.words_map = dict(zip(self.words, range(self.words_size)))    
        # print("words_map====>>>>", self.words_map)
        # 将文字转成向量     
        vector = get_labels_vector(self.words_map, self.labels[0])
        print("labels[0]:", self.labels[0])
        print("vector:", vector)

运行结果，

dataset/data_thchs30/train\A11_0.wav 绿是阳春烟景大块文章的底色四月的林峦更是绿得鲜活秀媚诗意盎然

labels[0]: 绿是阳春烟景大块文章的底色四月的林峦更是绿得鲜活秀媚诗意盎然

vector: [1901, 0, 1159, 0, 2506, 1156, 0, 1523, 0, 1172, 0, 579, 524, 0, 1113, 1800, 0, 1664, 0, 812, 2039, 0, 501, 1187, 0, 1664, 0, 1230, 0, 747, 0, 1183, 1159, 0, 1901, 0, 866, 0, 2633, 1414, 0, 1758, 644, 0, 2222, 936, 0, 1672, 1530]

然后，我们看一下1901是不是对应“绿”字，

这样，我们就将文字转成了向量。

9、将音频数据转为MFCC，将译文转为向量

现在，整合上面的audiofile_to_mfcc_vector函数和get_labels_vector函数，将音频数据转为时间序列（列）和MFCC（行）的矩阵，且将对应的译文转成字向量，代码如下，

# 将音频数据转为时间序列（列）和MFCC（行）的矩阵，将对应的译文转成字向量
def get_mfcc_and_transcriptch(wavs, labels, features, contexts, words_map):
    audio = []
    audio_len = []
    transcript = []
    transcript_len = []
 
    for wav, label in zip(wavs, labels):
        # load audio and convert to features
        audio_data = audiofile_to_mfcc_vector(wav, features, contexts)
        audio_data = audio_data.astype('float32')
        # print(words_map)
        audio.append(audio_data)
        audio_len.append(len(audio_data))
 
        # load text transcription and convert to numerical array        
        target = get_labels_vector(words_map, label)  # txt_obj是labels
        # target = text_to_char_array(target)
        transcript.append(target)
        transcript_len.append(len(target))
 
    audio = np.asarray(audio)
    audio_len = np.asarray(audio_len)
    transcript = np.asarray(transcript)
    transcript_len = np.asarray(transcript_len)
    return audio, audio_len, transcript, transcript_len

10、批次音频数据对齐

上面是对单个音频文件的特征补0，在训练中，文件是一批一批的获取并进行训练的，这就要求每一批音频的时序要统一，所以，下面要做对齐处理。

#对齐处理
def pad_sequences(sequences, maxlen=None, dtype=np.float32, value=0.):
    #[478 512 503 406 481 509 422 465]
    lengths = np.asarray([len(s) for s in sequences], dtype=np.int64)
 
    seqlen = len(sequences)
 
    #maxlen，该批次中，最长的序列长度
    if maxlen is None:
        maxlen = np.max(lengths)
 
    # 在下面的主循环中，从第一个非空序列中获取样本形状,以获取每个时序的mfcc特征数    
    sample_shape = tuple()
    for s in sequences:        
        if len(s) > 0:
            # (568, 494)
            sample_shape = np.asarray(s).shape[1:]
            break
    
    # (seqlen, maxlen, mfcclen)
    x = (np.ones((seqlen, maxlen) + sample_shape) * value).astype(dtype)
    
    for i, s in enumerate(sequences):
        if len(s) == 0:
            continue  # 序列为空，跳过
        
        if s.shape[1:] != sample_shape:
            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
                             (s.shape[1:], i, sample_shape))
 
        x[i, :len(s)] = s

    return x, lengths

11、创建序列的稀疏表示

因为TensorFlow中，计算ctc loss的函数需要我们提供标签的系数矩阵表示，这个在我们设计网络模型的时候就知道了。所以，我们需要将已经向量化的标签转成稀疏矩阵的形式。

稀疏矩阵是什么呢？稀疏矩阵是相对稠密矩阵而言的，稠密矩阵就是我们常见的矩阵，如果稠密矩阵大部分数都是0，那么就没有必要浪费空间来存这些为0的数据，我们只要将那些不为0的索引、值和形状记录下来，就可以大大节省内存空间，这个就是稀疏矩阵。而稀疏矩阵的稀疏表示，则是用3个数组来表示一个矩阵，其中，indices表示非零元素的下标，values表示indices中下标对应的非零值，shape表示形状。

# 创建序列的稀疏表示，这个才是真的稀疏矩阵
def sparse_tuple_(sequences, dtype=np.int32):
    indices = []
    values = []
 
    for i, seq in enumerate(sequences):
        for j, value in enumerate(seq):
            if value != 0:  
                indices.extend([[i, j]])
                values.extend([value])
 
    indices = np.asarray(indices, dtype=np.int64)
    values = np.asarray(values, dtype=dtype)
    shape = np.asarray([len(sequences), indices.max(0)[1] + 1], dtype=np.int64)

    return indices, values, shape

我们来测试一下上面的代码看看，测试代码如下，

if __name__ == "__main__":
    wav_path = 'dataset/data_thchs30/train'
    tran_path = 'dataset/data_thchs30/data'

    words, words_map = load_words_table_()    
    _, labels = get_wavs_and_tran_texts(wav_path, tran_path)
    ch_lable = get_labels_vector(words_map, labels[0])
    stuple = sparse_tuple_([ch_lable])
    print("indices:", stuple[0])
    print("values:", stuple[1])
    print("shape:", stuple[2])

运行结果如下，

12、将稀疏表示转成文字

上面我们将文字转成了稀疏表示，现在我们需要将稀疏表示还原为文字，

13、将字向量转成文字

上面有将文字转成字向量的函数，那么，也应该有将字向量转成文字的函数，代码如下，代码如下，

# 将稀疏矩阵的字向量转成文字
# tuple是sparse_tuple函数的返回值
def sparse_tuple_to_text_(tuple, words):
    # 索引
    indices = tuple[0]
    # 字向量
    values = tuple[1]
    
    dense = np.zeros(tuple[2]).astype(np.int32)

    for i in range(len(indices)):
        dense[indices[i][0]][indices[i][1]] = values[i]
    
    results = [''] * tuple[2][0]
    for i in range(dense.shape[0]):
        for j in range(dense.shape[1]):            
            c = dense[i][j]
            c = ' ' if c == SPACE_INDEX else words[c]
            results[i] = results[i] + c
 
    return results

我们来测试一下。测试代码如下，

if __name__ == "__main__":
    wav_path = 'dataset/data_thchs30/train'
    tran_path = 'dataset/data_thchs30/data'

    words, words_map = load_words_table_()    
    _, labels = get_wavs_and_tran_texts(wav_path, tran_path)
    ch_lable = get_labels_vector(words_map, labels[0])
    stuple = sparse_tuple_([ch_lable])
    print("indices:", stuple[0])
    print("values:", stuple[1])
    print("shape:", stuple[2])
    texts = sparse_tuple_to_text_(stuple, words)
    print("texts:", texts)

运行结果，

14、“假”的稀疏矩阵表示

上面我们测试的是单个的句子的稀疏表示和将稀疏表示转成文字，但是我们在训练中一般都是批量的，所以，我们还是使用上面的测试代码，但是我们使用批量的数据，代码如下，

if __name__ == "__main__":
    wav_path = 'dataset/data_thchs30/train'
    tran_path = 'dataset/data_thchs30/data'

    words, words_map = load_words_table_()    
    _, labels = get_wavs_and_tran_texts(wav_path, tran_path)
    ch_lable = get_labels_vector(words_map, labels[0])
    ch_lable1 = get_labels_vector(words_map, labels[1]) 
    stuple = sparse_tuple([ch_lable, ch_lable1])
    print("indices:", stuple[0])
    print("values:", stuple[1])
    print("shape:", stuple[2])
    texts = sparse_tuple_to_text_(stuple, words)
    print("texts:", texts)

运行结果，

可以看到，在稀疏表示还原成文字的时候，“盎然”后面多了几个空格，这是因为批量处理时，矩阵的shape是统一大小的，所以对于比较短的句子，就会在句子后面“补零”了。

为了避免这个问题，我们在将文字转成稀疏表示的时候，不管元素是否为0，都包含进来，这样我们在还原的时候就可以清楚的知道矩阵的真实长度，就不会出现上面这种现象，所以，我们修改一下上面的sparse_tuple_ 和 sparse_tuple_to_text_函数，为了不混淆，将函数名字改成sparse_tuple 和 sparse_tuple_to_text（就是去掉了函数名末尾的下划线），代码如下，

# 创建序列的稀疏表示，为了方便，我们这里只是做假的稀疏矩阵，我们只是需要稀疏矩阵的形式，因为ctc计算需要这种形式
def sparse_tuple(sequences, dtype=np.int32):
    indices = []
    values = []
 
    for n, seq in enumerate(sequences):
        indices.extend(zip([n] * len(seq), range(len(seq))))
        values.extend(seq)
 
    indices = np.asarray(indices, dtype=np.int64)
    values = np.asarray(values, dtype=dtype)
    shape = np.asarray([len(sequences), indices.max(0)[1] + 1], dtype=np.int64)
 
    # return tf.SparseTensor(indices=indices, values=values, shape=shape)
    return indices, values, shape

# 将稀疏矩阵的字向量转成文字，我们这里的稀疏矩阵也是假的
# tuple是sparse_tuple函数的返回值
def sparse_tuple_to_text(tuple, words):
    # 索引
    indices = tuple[0]
    # 字向量
    values = tuple[1]
    results = [''] * tuple[2][0]
    for i in range(len(indices)):
        index = indices[i][0]
        c = values[i]
        c = ' ' if c == SPACE_INDEX else words[c]
        results[index] = results[index] + c
 
    return results

然后，我们再来测试一下，测试代码如下，

if __name__ == "__main__":
    wav_path = 'dataset/data_thchs30/train'
    tran_path = 'dataset/data_thchs30/data'

    words, words_map = load_words_table_()    
    _, labels = get_wavs_and_tran_texts(wav_path, tran_path)

    ch_lable = get_labels_vector(words_map, labels[0])
    ch_lable1 = get_labels_vector(words_map, labels[1])
    stuple = sparse_tuple([ch_lable, ch_lable1])
    
    print("indices:", stuple[0])
    print("values:", stuple[1])
    print("shape:", stuple[2])
    texts = sparse_tuple_to_text(stuple, words)
    print("texts:", texts)

运行结果，

15、next_batch函数

接下来，我们来实现AudioProcessor类的next_batch函数，代码如下，

def next_batch(self, start_index=0, batch_size=1):
        filesize = len(self.labels)
        # 计算要获取的序列的开始和结束下标
        end_index = min(filesize, start_index + batch_size)
        index_list = range(start_index, end_index)
        # 获取要训练的音频文件路径和对于的译文
        labels = [self.labels[i] for i in index_list]
        wavs = [self.wavs[i] for i in index_list]
        # 将音频文件转成要训练的数据
        (source, _, target, _) = get_mfcc_and_transcriptch(wavs, labels, self.features,
                                                                self.contexts, self.words_map)
    
        start_index += batch_size
        # Verify that the start_index is not largVerify that the start_index is not ler than total available sample size
        if start_index >= filesize:
            start_index = -1
    
        # Pad input to max_time_step of this batch
        # 对齐处理，如果是多个文件，将长度统一，支持按最大截断或补0
        source, source_lengths = pad_sequences(source)
        # 返回序列的稀疏表示
        sparse_labels = sparse_tuple(target)
    
        return start_index, source, source_lengths, sparse_labels

为了验证我们上面所有写的的代码是否达到预期，同样写一个测试代码来测试，代码如下，

if __name__ == "__main__":
    wav_path = 'dataset/data_thchs30/train'
    tran_path = 'dataset/data_thchs30/data'
  

    # 梅尔倒谱系数的个数
    features = 26
    # 对于每个时间序列，要包含上下文样本的个数
    contexts = 9
    # batch大小
    batch_size = 8

    processor = AudioProcessor(wav_path, tran_path, features, contexts)
    next_index = 0
    for batch in range(5):  # 一次batch_size，取多少次
        next_index, source, source_lengths, sparse_labels = processor.next_batch(next_index, batch_size)        
        print("source_lengths:", source_lengths)
        print("sparse_labels:", sparse_labels)

运行结果，