贝尔实验室
连续语音识别 continuous speech recognizer
自动音乐分类系统Automatic

语音识别步骤

1、声波输入计算机 decoding Raw audio
2、将声波sound waves 转化为 数字 进行存储。声波是一维的,只需要等距地记录 波的高度
3、抽样sampling ,每秒钟读取N个样品。
奈奎斯特定理:采样速度 >= 2 * 声音最高频率f_max
在这里插入图片描述
将声音分成多个片段,如每段20ms
4、将 数字 转化成 简单的折线图

5、计算每个 频段 的能量,为音频片段audio snippet 创建 声纹:将声音信号 画成 频谱图spectrogram。纵轴频率,横轴时间。
6、对音频片段进行切片,找出声音的字母。元音Vowel。
在这里插入图片描述
通过神经网络 预测每个字母的下一个字母的可能性。

音频进行 特征提取 ,取出 pitchMFCC,进行模型训练,训练分类器。当输入未知音频时,模型会进行预测。
在这里插入图片描述
音乐格式:

WMA——Windows Media、
Mp3、
wav

CD的采样率是44.1khz

一、用SVM做音乐分类应用实例

1、数据集:EchoNest。

点击下载本实验数据库

2、代码:

import pandas as pd
#读取曲目
tracks = pd.read_csv('D:/My life/music/echonest/fma-rock-vs-hiphop.csv')
print(tracks.shape)
#tracks[0:5]
tracks

(17734, 21)

track_idbit_ratecommentscomposerdate_createddate_recordeddurationfavoritesgenre_topgenres...informationinterestlanguage_codelicenselistenslyricistnumberpublishertagstitle
01352560001NaN2008-11-26 01:43:262008-11-26 00:00:008370Rock[45, 58]...NaN2484enAttribution-NonCommercial-ShareAlike 3.0 Inter...1832NaN0NaN[]Father's Day
11362560001NaN2008-11-26 01:43:352008-11-26 00:00:005090Rock[45, 58]...NaN1948enAttribution-NonCommercial-ShareAlike 3.0 Inter...1498NaN0NaN[]Peel Back The Mountain Sky
21511920000NaN2008-11-26 01:44:55NaN1920Rock[25]...NaN701enAttribution-NonCommercial-ShareAlike 3.0 Inter...148NaN4NaN[]Untitled 04
31521920000NaN2008-11-26 01:44:58NaN1930Rock[25]...NaN637enAttribution-NonCommercial-ShareAlike 3.0 Inter...98NaN11NaN[]Untitled 11
41532560000Arc and Sender2008-11-26 01:45:002008-11-26 00:00:004055Rock[26]...NaN354enAttribution-NonCommercial-NoDerivatives (aka M...424NaN2NaN[]Hundred-Year Flood
..................................................................
177291550633200000NaN2017-03-24 19:40:43NaN2833Hip-Hop[21, 811]...NaN1283NaNAttribution1050NaN4NaN['old school beats', '2017 free instrumentals'...Been On
177301550643200000NaN2017-03-24 19:40:44NaN2502Hip-Hop[21, 811]...NaN1077NaNAttribution858NaN2NaN['old school beats', '2017 free instrumentals'...Send Me
177311550653200000NaN2017-03-24 19:40:45NaN2193Hip-Hop[21, 811]...NaN1340NaNAttribution1142NaN1NaN['old school beats', '2017 free instrumentals'...The Question
177321550663200000NaN2017-03-24 19:40:47NaN2526Hip-Hop[21, 811]...NaN2065NaNAttribution1474NaN3NaN['old school beats', '2017 free instrumentals'...Roy
177331552473200000Fleslit2017-03-29 01:40:28NaN2113Hip-Hop[21, 539, 811]...NaN1379NaNAttribution1025NaN0Fleslit['instrumental trap beat', 'love', 'instrument...Love In The Sky

17734 rows × 21 columns

#读入in指标tracks metrics
echonest_metrics = pd.read_json('D:/My life/music/echonest/echonest-metrics.json', precise_float = True)
print(echonest_metrics.shape)
echonest_metrics

(13129, 9)

输出:

track_idacousticnessdanceabilityenergyinstrumentalnesslivenessspeechinesstempovalence
020.4166750.6758940.6344760.0106280.1776470.159310165.9220.576661
130.3744080.5286430.8174610.0018510.1058800.461818126.9570.269240
250.0435670.7455660.7014700.0006970.3731430.124595100.2600.621661
3100.9516700.6581790.9245250.9654270.1154740.032985111.5620.963590
41340.4522170.5132380.5604100.0194430.0965670.525519114.2900.894072
..............................
131241248570.0075920.7903640.7192880.8531140.7207150.082550141.3320.890461
131251248620.0414980.8430770.5364960.8651510.5479490.074001101.9750.476845
131261248630.0001240.6096860.8951360.8466240.6329030.051517129.9960.496667
131271248640.3275760.5744260.5483270.4528670.0759280.033388142.0090.569274
131281249110.9936060.4993390.0506220.9456770.0959650.065189119.9650.204652

13129 rows × 9 columns

#合并
echo_tracks = pd.merge(echonest_metrics, tracks[['track_id', 'genre_top']], on = 'track_id')
print(echo_tracks.shape)
echo_tracks

(4802, 10)

输出:

track_idacousticnessdanceabilityenergyinstrumentalnesslivenessspeechinesstempovalencegenre_top
020.4166750.6758940.6344761.062807e-020.1776470.159310165.9220.576661Hip-Hop
130.3744080.5286430.8174611.851103e-030.1058800.461818126.9570.269240Hip-Hop
250.0435670.7455660.7014706.967990e-040.3731430.124595100.2600.621661Hip-Hop
31340.4522170.5132380.5604101.944269e-020.0965670.525519114.2900.894072Hip-Hop
41530.9883060.2556610.9797749.730057e-010.1213420.05174090.2410.034018Rock
.................................
47971247180.4121940.6868250.8493096.000000e-100.8675430.36731596.1040.692414Hip-Hop
47981247190.0549730.6175350.7285677.215700e-060.1314380.24313096.2620.399720Hip-Hop
47991247200.0104780.6524830.6574987.098000e-070.7015230.22917494.8850.432240Hip-Hop
48001247210.0679060.4324210.7645081.625500e-060.1044120.310553171.3290.580087Hip-Hop
48011247220.1535180.6386600.7625675.000000e-100.2648470.30337277.8420.656612Hip-Hop

4802 rows × 10 columns

#检查结果数据-dataframe
echo_tracks.info()

输出:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4802 entries, 0 to 4801
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          4802 non-null   int64  
 1   acousticness      4802 non-null   float64
 2   danceability      4802 non-null   float64
 3   energy            4802 non-null   float64
 4   instrumentalness  4802 non-null   float64
 5   liveness          4802 non-null   float64
 6   speechiness       4802 non-null   float64
 7   tempo             4802 non-null   float64
 8   valence           4802 non-null   float64
 9   genre_top         4802 non-null   object 
dtypes: float64(8), int64(1), object(1)
memory usage: 412.7+ KB
echo_tracks.describe()

输出:

track_idacousticnessdanceabilityenergyinstrumentalnesslivenessspeechinesstempovalence
count4802.0000004.802000e+034802.0000004802.0000004802.0000004802.0000004802.0000004802.0000004802.000000
mean30164.8717204.870600e-010.4365560.6251260.6040960.1879970.104877126.6879440.453413
std28592.0137963.681396e-010.1835020.2440510.3764870.1505620.14593434.0024730.266632
min2.0000009.491000e-070.0513070.0002790.0000000.0252970.02323429.0930000.014392
25%7494.2500008.351236e-020.2960470.4507570.1649720.1040520.03689798.0007500.224617
50%20723.5000005.156888e-010.4194470.6483740.8087520.1230800.049594124.6255000.446240
75%44240.7500008.555765e-010.5653390.8370160.9154720.2151510.088290151.4500000.666914
max124722.0000009.957965e-010.9618710.9997680.9931340.9713920.966177250.0590000.983649

连续变量之间的成对关系-pairwis,保持模型简单并提高可解释性。

# 相关矩阵-CM correlation matrix
corr_metrics = echo_tracks.corr()
corr_metrics.style.background_gradient()

输出:
在这里插入图片描述

①数据归一化。沿着最大方差轴旋转数据,确定数据的每个特征 对 类之间方差的相对贡献。特征的均值=0,标准差=1。

# 定义特征
features = echo_tracks.drop(['genre_top', 'track_id'], axis = 1)
# 定义标签
labels = echo_tracks['genre_top']

labels

输出:

0       Hip-Hop
1       Hip-Hop
2       Hip-Hop
3       Hip-Hop
4          Rock
         ...   
4797    Hip-Hop
4798    Hip-Hop
4799    Hip-Hop
4800    Hip-Hop
4801    Hip-Hop
Name: genre_top, Length: 4802, dtype: object
features[0:5]

输出:

acousticnessdanceabilityenergyinstrumentalnesslivenessspeechinesstempovalence
00.4166750.6758940.6344760.0106280.1776470.159310165.9220.576661
10.3744080.5286430.8174610.0018510.1058800.461818126.9570.269240
20.0435670.7455660.7014700.0006970.3731430.124595100.2600.621661
30.4522170.5132380.5604100.0194430.0965670.525519114.2900.894072
40.9883060.2556610.9797740.9730060.1213420.05174090.2410.034018
# 导入标准化
from sklearn.preprocessing import StandardScaler

# 缩放特征,设置新变量的值
scaler = StandardScaler()
scaled_train_features = scaler.fit_transform(features)
scaler

StandardScaler()

scaled_train_features[0:5]

输出:

array([[-0.19121034,  1.30442004,  0.03831594, -1.57649422, -0.06875487,
         0.37303429,  1.15397908,  0.46228696],
       [-0.30603598,  0.50188641,  0.78817624, -1.59980943, -0.54546309,
         2.44615517,  0.00791367, -0.69081137],
       [-1.20481276,  1.68413943,  0.31285194, -1.60287574,  1.22982787,
         0.13513049, -0.77731688,  0.63107745],
       [-0.09465518,  0.41792741, -0.26520319, -1.55307896, -0.60732615,
         2.88270682, -0.36465686,  1.6528586 ],
       [ 1.36170559, -0.98589622,  1.45332318,  0.97997488, -0.44275673,
        -0.36415677, -1.07200261, -1.57310227]])

②主成分分析(PCA)——减少特征数量,减少数据维数。

# 绘图
%matplotlib inline

# 导入绘图模块和PCA模块
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# PCA获得方差比——all features
pca = PCA()
pca.fit(scaled_train_features)
exp_variance = pca.explained_variance_ratio_

# 条形图barplot 绘制方差
fig,ax = plt.subplots()  # 注意这里是subplots,不是subplot!!!切记加s!哭辽,最开始打错了,结果就一致报错。
ax.bar(x = range(pca.n_components_), height = exp_variance)
ax.set_xlabel('Principal Component')

Text(0.5, 0, ‘Principal Component’)
在这里插入图片描述

import numpy as np

# 计算累计解释方差
cum_exp_variance = np.cumsum(exp_variance)

# 绘制累计解释方差-0.9处绘制虚线
fig, ax = plt.subplots()
ax.plot(cum_exp_variance)
ax.axhline(y = 0.9, linestyle = '--')
n_components = 7

# 选定数量的组件 执行PCA-数据投影到组件 component
pca = PCA(n_components, random_state = 10)
pca.fit(scaled_train_features)
pca_projection = pca.transform(scaled_train_features)
print(pca_projection)
print(scaled_train_features)

输出:

[[ 1.59666656  1.0500117  -0.01778555 ... -0.36832686 -0.71505324
  -0.28731253]
 [ 1.58153526  1.07661327  1.04346038 ... -1.81917099  1.3884574
   0.12558375]
 [ 2.01545627  1.4085176   0.24506524 ...  0.62769959 -0.45716338
  -0.05285551]
 ...
 [ 1.66908628  1.84010121  2.38294303 ...  1.23664547 -0.63277253
   0.60721569]
 [ 1.17001951  2.03158181  0.08689922 ... -1.45765649 -0.03590123
  -0.02431674]
 [ 2.36368976  1.15900708  0.4473735  ... -0.03592518  0.82678557
  -0.14947633]]
[[-0.19121034  1.30442004  0.03831594 ...  0.37303429  1.15397908
   0.46228696]
 [-0.30603598  0.50188641  0.78817624 ...  2.44615517  0.00791367
  -0.69081137]
 [-1.20481276  1.68413943  0.31285194 ...  0.13513049 -0.77731688
   0.63107745]
 ...
 [-1.29470431  1.17682795  0.13265633 ...  0.85182206 -0.93541008
  -0.07941825]
 [-1.13869115 -0.02253433  0.57117905 ...  1.40951543  1.31301348
   0.47513794]
 [-0.90611434  1.10148973  0.56322452 ...  1.36030881 -1.43669053
   0.76217464]]

在这里插入图片描述

③训练决策树 对流派进行分类。使用数据的低维PCA投影,将歌曲分类为流派genres。首先将数据集分为 训练集 和 测试集

# 导入train_test_split函数
# 导入Decision tree classifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

#分割数据
train_features, test_features, train_labels, test_labels = train_test_split(pca_projection, labels, random_state = 10)

# 训练决策树
tree = DecisionTreeClassifier(random_state = 10)
tree.fit(train_features, train_labels)

# 预测 测试数据的标签
pred_labels_tree = tree.predict(test_features)
pred_labels_tree[0:100]

输出:

array(['Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Hip-Hop', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Hip-Hop',
       'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Hip-Hop', 'Rock',
       'Hip-Hop', 'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Hip-Hop',
       'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Rock',
       'Hip-Hop', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Hip-Hop',
       'Hip-Hop'], dtype=object)
#比较决策树 与 逻辑回归
#导入LogisticRegression
from sklearn.linear_model import LogisticRegression

# 训练 逻辑回归 并预测测试集的标签
logreg = LogisticRegression(random_state = 10)
logreg.fit(train_features, train_labels)
pred_labels_logit = logreg.predict(test_features)

# 两个模型创建分类报告
from sklearn.metrics import classification_report
class_rep_tree = classification_report(test_labels, pred_labels_tree)
class_rep_log = classification_report(test_labels, pred_labels_logit)

print("Decision Tree:\n", class_rep_tree)
print("Logistic Regression:\n", class_rep_log)

输出:

Decision Tree:
               precision    recall  f1-score   support

     Hip-Hop       0.68      0.66      0.67       235
        Rock       0.92      0.93      0.92       966

    accuracy                           0.87      1201
   macro avg       0.80      0.79      0.80      1201
weighted avg       0.87      0.87      0.87      1201

Logistic Regression:
               precision    recall  f1-score   support

     Hip-Hop       0.78      0.57      0.66       235
        Rock       0.90      0.96      0.93       966

    accuracy                           0.88      1201
   macro avg       0.84      0.76      0.79      1201
weighted avg       0.88      0.88      0.88      1201
pred_labels_logit.shape

(1201,)

len(pred_labels_logit)

1201

pred_labels_logit[0:100]

输出:

array(['Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Hip-Hop', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Hip-Hop',
       'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Hip-Hop', 'Rock',
       'Hip-Hop', 'Rock', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Hip-Hop', 'Hip-Hop', 'Rock',
       'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
       'Rock', 'Rock', 'Hip-Hop', 'Rock', 'Rock', 'Rock', 'Rock',
       'Hip-Hop', 'Hip-Hop'], dtype=object)

④用交叉验证CV来评测模型

from sklearn.model_selection import KFold, cross_val_score

# 设置K折交叉验证
kf = KFold(n_splits = 10, random_state = 10)

tree = DecisionTreeClassifier(random_state = 10)
logreg = LogisticRegression(random_state = 10, solver = 'lbfgs')

# KFold cv 训练模型
tree_score = cross_val_score(estimator = tree, X = pca_projection, y = labels, cv = kf)
logit_score = cross_val_score(estimator = logreg, X = pca_projection, y = labels, cv = kf)

# 打印 分数数组的平均值
print("Decision Tree:", np.mean(tree_score).round(4), "\nLogistic Regression:", np.mean(logit_score).round(4))

Decision Tree: 0.86
Logistic Regression: 0.8794

二、TTS(text to speech):文本转声音(可以播放中英文)

import gtts
import pyttsx3

gtts.__version__

‘2.2.3’

engine = pyttsx3.init()
engine.say("Sweeter")
engine.say("音乐是时间的艺术")
engine.runAndWait()

三、播放音频

安装:pip install pyaudio

import playsound
from playsound import playsound

playsound('D:/My life/music/some music/sodagreen/take_me_away.wav')

四、STT(speech to text):语音转文本

实现 不同时长、不同口音、相同文本的语音 正确转化

安装:pip install SpeechRecognition
http://github.com/Uberi/speech_recognition # readme
导入:import speech_recognition as sr

import speech_recognition as sr
print(sr.__version__)

3.8.1

r = sr.Recognizer()
r

<speech_recognition.Recognizer at 0x1fab1004490>

harvard = sr.AudioFile('D:/My life/music/some music/sodagreen/take_me_away.wav')
with harvard as source:
    audio = r.record(source)
    
audio
type(audio)
r.recognize_google(audio, language = 'zh-tw')#繁体中文
#eg 英文

type(audio)
r.recognize_google(audio)

五、麦克风语音转文字

import speech_recognition as sr
r = sr.Recognizer()
mic = sr.Microphone()
sr.Microphone.list_microphone_names()
['Microsoft Sound Mapper - Input',
 '麦克风阵列 (Realtek(R) Audio)',
 'Microsoft Sound Mapper - Output',
 '扬声器 (Realtek(R) Audio)',
 '主声音捕获驱动程序',
 '麦克风阵列 (Realtek(R) Audio)',
 '主声音驱动程序',
 '扬声器 (Realtek(R) Audio)',
 '扬声器 (Realtek(R) Audio)',
 '麦克风阵列 (Realtek(R) Audio)',
 '麦克风阵列 (Realtek HD Audio Mic input)',
 '立体声混音 (Realtek HD Audio Stereo input)',
 'Speakers (Realtek HD Audio output)']
with mic as source:
    audio = r.listen(source)
    
r.recognize_google(audio)
Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐